This project explores each variables and find out which variables are leading influencer on quality of red wine.
The data I am going to use is a dataset on red wine.
The data was collected in 2009 by Paulo Corte3z, Antonio Cerdeira, Fernado Almeida, Telmo Matos and Jose Reis to explore the relationship between quality of wine and its chemical substances.
I was interestedin exploring this dataset since I love drinking wine and was wondering which factor has the most impact on quality of red wine.
Through plots and analysis I hope I could find some of the factors that could help me explain the quality of red wine.
## [1] 1599 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
As you could see above, there are 12 variables and 1599 observations.
I will now look at the distribution of each 12 variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
From looking at the histogram of the fixed acidity, we could notice that distribution of fixed.acidty is normal with peak around at 7.8. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4932 0.7306 0.8041 0.7977 0.8618 1.1647
From looking at the histogram of the volatile.acidity^(1/3), we could notice that distribution of volatile.acidity^(1/3) is normal with peak around at 0.85. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
From looking at the histogram of the sqrt(citric.acid), we could notice that distribution of sqrt(citric.acid) is normal with peak around at 0.5 and 0.75. There is another peak in 0 since sqrt of 0 is 0, which means transformation did not have effect on 0.
## TRUE
## 132
There are 132 wines that have 0 citric acid.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
From looking at the histogram of the log10(residual.sugar), we could notice that distribution of log10(residual.sugar) is normal with peak around at 0.3 and 0.4. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
From looking at the histogram of the log10(chlorides), we could notice that distribution of log10(chlorides) is normal with peak around at -1.1 and -1.2. There is suspected outlier on the both sides and I should consider whether to exclude the outlier or not.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
From looking at the histogram of the sqrt(free.sulfur.dioxide), we could notice that distribution of sqrt(free.sulfur.dioxide) is skewed to right with peaks around at 2.5 and 4. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
From looking at the histogram of the log10(total.sulfur.dioxide), we could notice that distribution of log10(total.sulfur.dioxide) is normal with peaks around at 1.5 and 1.75. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.
In total sulfur dioxide there is free and bound forms. I will make another variable for bound sulfur dioxide by subtracting free sulfur dioxide from total sulfur dioxide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 12.00 21.00 30.59 39.00 251.50
From looking at the histogram of the log10(bound.sulfur.dioxide), we could notice that distribution of log10(bound.sulfur.dioxide) is normal with peaks around at 1 and 1.5. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
From looking at the histogram of the density, we could notice that distribution of density is normal with peak around at 0.997.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
From looking at the histogram of the pH, we could notice that distribution of pH is normal with peak around at 3.25 and 3.3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
From looking at the histogram of the log10(sulphates), we could notice that distribution of log10(sulphates) is normal to right with peaks around at -0.2. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
From looking at the histogram of the alcohol, we could notice that distribution of alcohol is skewed to right with peaks around at 9.5. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.
I will make another variable with dividing alcohol into 5 categories.
##
## very low low medium high very high
## 552 639 304 96 8
From looking at the table and plot we could notice that most alcohol are below 11.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
From looking at the table above, we could notice that most wine received 5 or 6 on their quality.
I will make another variables with dividing quality into 3 categories(low, medium, high) and two categories(low, high)
##
## low medium high
## 63 1319 217
From looking at the table we could notice that most red wines fall into medium category which is wine with 5 or 6 quality.
##
## low high
## 744 855
From looking at the table we could notice quality is almost evenly divided into low and high.
There are 1599 red wine in the dataset with 13 variables, including that I have made.
## 'data.frame': 1599 obs. of 16 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ bound.sulfur.dioxide: num 23 42 39 43 23 27 44 6 9 85 ...
## $ alcohol_lev : Factor w/ 5 levels "very low","low",..: 1 2 2 2 1 1 1 2 1 2 ...
## $ quality_3 : Factor w/ 3 levels "low","medium",..: 2 2 2 2 2 2 2 3 3 2 ...
## $ quality_2 : Factor w/ 2 levels "low","high": 1 1 1 2 1 1 1 2 2 1 ...
I have added bound sulfur dioxide variable because bound sulfur might be the one that influence the quality of the wine.
I also had to transform various variables to make the distribution normal. Most of the graphes were skewed to right, so I used log10 and sqrt function to make the distribution normal.
For next secion, I will explore to determine which variables are best for predicting the quality of the red wine.
From looking at above, we could notice that citric acid, alcohol, and sulphates of red wine has highest correlation to quality of red wine.
From looking at the table above we could notice that there might be Multicollinearity problem if we look at multivariate relationship. I will consider this fact in the next section for multivariate relatinoship.
From looking at the table above it is hard to notice whether there is linear relationship between variables with quality of red wine, especially when quality is more of a categorical variable.
I will look closely into it through graphing and using anova test.
From looking at the scatter plot above, it is hard to notice the relationship between red wine quality and fixed acidity, especially since wine quality is a categorical variable. I will look into box plots to examine the relationship and I will change the quality as a factor from numeric to make a boxplot.
## 'data.frame': 1599 obs. of 16 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## $ bound.sulfur.dioxide: num 23 42 39 43 23 27 44 6 9 85 ...
## $ alcohol_lev : Factor w/ 5 levels "very low","low",..: 1 2 2 2 1 1 1 2 1 2 ...
## $ quality_3 : Factor w/ 3 levels "low","medium",..: 2 2 2 2 2 2 2 3 3 2 ...
## $ quality_2 : Factor w/ 2 levels "low","high": 1 1 1 2 1 1 1 2 2 1 ...
From looking at the boxplots above we don’t see much strong relationship between quality and fixed acidity.
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 94 18.737 6.283 8.79e-06 ***
## Residuals 1593 4751 2.982
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and fixed acidity. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = fixed.acidity ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 -0.58075472 -2.27948640 1.1179770 0.9257629
## 5-3 -0.19274596 -1.76223366 1.3767417 0.9993075
## 6-3 -0.01282132 -1.58307424 1.5574316 1.0000000
## 7-3 0.51236181 -1.08439601 2.1091196 0.9426320
## 8-3 0.20666667 -1.73661257 2.1499459 0.9996570
## 5-4 0.38800876 -0.31462496 1.0906425 0.6148684
## 6-4 0.56793340 -0.13640797 1.2722748 0.1942859
## 7-4 1.09311653 0.33151423 1.8547188 0.0006306
## 8-4 0.78742138 -0.55672768 2.1315705 0.5509949
## 6-5 0.17992465 -0.09155105 0.4514003 0.4080237
## 7-5 0.70510777 0.30806829 1.1021472 0.0000067
## 8-5 0.39941263 -0.77716674 1.5759920 0.9278394
## 7-6 0.52518313 0.12512942 0.9252368 0.0025626
## 8-6 0.21948798 -0.95811196 1.3970879 0.9948930
## 8-7 -0.30569514 -1.51841231 0.9070220 0.9796484
From the graph above, we could notice that there is a significant difference between 7-4, 7-5 and 7-6.
From looking at the boxplots above, we could notice some relationship between volatile acidity and quality. It seems like volatile acidity decrease as quality of wine increases.
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 8.22 1.645 60.91 <2e-16 ***
## Residuals 1593 43.01 0.027
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and volatile acidity. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = volatile.acidity ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 -0.19053774 -0.35217798 -0.02889749 0.0102247
## 5-3 -0.30745888 -0.45680111 -0.15811665 0.0000001
## 6-3 -0.38701567 -0.53643072 -0.23760063 0.0000000
## 7-3 -0.48058040 -0.63251748 -0.32864332 0.0000000
## 8-3 -0.46116667 -0.64607647 -0.27625687 0.0000000
## 5-4 -0.11692115 -0.18377920 -0.05006310 0.0000099
## 6-4 -0.19647794 -0.26349848 -0.12945740 0.0000000
## 7-4 -0.29004267 -0.36251178 -0.21757355 0.0000000
## 8-4 -0.27062893 -0.39852940 -0.14272846 0.0000000
## 6-5 -0.07955679 -0.10538865 -0.05372493 0.0000000
## 7-5 -0.17312152 -0.21090121 -0.13534183 0.0000000
## 8-5 -0.15370778 -0.26566341 -0.04175215 0.0013080
## 7-6 -0.09356473 -0.13163123 -0.05549822 0.0000000
## 8-6 -0.07415099 -0.18620374 0.03790175 0.4098254
## 8-7 0.01941374 -0.09598053 0.13480800 0.9968509
From the graph above, we could notice that there is a significant difference between every variables except 6-8 and 7-8. From looking at the graph above, I am considering whether I should group the quality into three sections as low(3-4), medium(5-6), and high(7-8) in order to better explained the relationship between volatile acidity and quality.
From looking at the boxplots above, we could notice some relationship between citric acid and quality. It seems like citric acid increase as quality of wine increases.
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 3.53 0.7059 19.69 <2e-16 ***
## Residuals 1593 57.11 0.0359
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and citric acid. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = citric.acid ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 0.003150943 -0.1831058063 0.18940769 1.0000000
## 5-3 0.072685756 -0.0994000884 0.24477160 0.8345084
## 6-3 0.102824451 -0.0693452965 0.27499420 0.5292715
## 7-3 0.204175879 0.0291000127 0.37925175 0.0115446
## 8-3 0.220111111 0.0070410437 0.43318118 0.0381644
## 5-4 0.069534813 -0.0075051774 0.14657480 0.1039655
## 6-4 0.099673508 0.0224462830 0.17690073 0.0032561
## 7-4 0.201024936 0.1175193597 0.28453051 0.0000000
## 8-4 0.216960168 0.0695814861 0.36433885 0.0004036
## 6-5 0.030138695 0.0003728525 0.05990454 0.0451915
## 7-5 0.131490123 0.0879568901 0.17502336 0.0000000
## 8-5 0.147425355 0.0184197856 0.27643092 0.0144221
## 7-6 0.101351428 0.0574877008 0.14521516 0.0000000
## 8-6 0.117286660 -0.0118308104 0.24640413 0.0998116
## 8-7 0.015935232 -0.1170326519 0.14890312 0.9993852
From the graph above, we could notice that there is a significant difference between 7-3, 8-3, 6-4, 7-4, 8-4, 6-5, 7-5, 8-5, and 7-6.
From looking at the boxplots above we don’t see much strong relationship between quality and residual sugar.
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 10 2.094 1.053 0.385
## Residuals 1593 3166 1.988
From looking at the anova table, we cannot reject null hypothesis that there isn’t significant relationship between quality and residual sugar.
From looking at the boxplots above we don’t see much strong relationship between quality and chlorides.
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 0.066 0.013162 6.036 1.53e-05 ***
## Residuals 1593 3.474 0.002181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and chlorides. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = chlorides ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 -0.031820755 -0.07775835 0.0141168441 0.3563775
## 5-3 -0.029764317 -0.07220686 0.0126782279 0.3421496
## 6-3 -0.037543887 -0.08000713 0.0049193515 0.1180933
## 7-3 -0.045912060 -0.08909205 -0.0027320685 0.0295304
## 8-3 -0.054055556 -0.10660628 -0.0015048304 0.0395900
## 5-4 0.002056438 -0.01694439 0.0210572639 0.9996262
## 6-4 -0.005723132 -0.02477014 0.0133238728 0.9563871
## 7-4 -0.014091306 -0.03468678 0.0065041663 0.3707314
## 8-4 -0.022234801 -0.05858367 0.0141140711 0.5018527
## 6-5 -0.007779570 -0.01512090 -0.0004382449 0.0304543
## 7-5 -0.016147743 -0.02688460 -0.0054108855 0.0002720
## 8-5 -0.024291238 -0.05610864 0.0075261647 0.2484623
## 7-6 -0.008368173 -0.01918654 0.0024501961 0.2349638
## 8-6 -0.016511668 -0.04835667 0.0153333334 0.6775878
## 8-7 -0.008143495 -0.04093815 0.0246511567 0.9809645
From the graph above, we could notice that there is a significant difference between 7-3, 8-3, 6-5, and 7-5.
From looking at the boxplots above we don’t see much strong relationship between quality and free sulfur dioxide.
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 2571 514.1 4.754 0.000257 ***
## Residuals 1593 172274 108.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and free sulfur dioxide. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = free.sulfur.dioxide ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 1.2641509 -8.9655860 11.4938879 0.9992862
## 5-3 5.9838473 -3.4675842 15.4352788 0.4618281
## 6-3 4.7115987 -4.7444410 14.1676385 0.7138656
## 7-3 3.0452261 -6.5704257 12.6608780 0.9456583
## 8-3 2.2777778 -9.4246209 13.9801764 0.9937451
## 5-4 4.7196963 0.4884466 8.9509461 0.0185784
## 6-4 3.4474478 -0.7940854 7.6889810 0.1868980
## 7-4 1.7810752 -2.8052825 6.3674329 0.8782125
## 8-4 1.0136268 -7.0808188 9.1080725 0.9992387
## 6-5 -1.2722485 -2.9070711 0.3625740 0.2288173
## 7-5 -2.9386212 -5.3295870 -0.5476553 0.0061996
## 8-5 -3.7060695 -10.7914129 3.3792739 0.6692481
## 7-6 -1.6663726 -4.0754901 0.7427448 0.3580539
## 8-6 -2.4338210 -9.5253103 4.6576683 0.9246011
## 8-7 -0.7674484 -8.0704130 6.5355163 0.9996765
From the graph above, we could notice that there is a significant difference between 5-4 and 7-5.
From looking at the boxplots above we don’t see much strong relationship between quality and bound sulfur dioxide.
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 98706 19741 29.36 <2e-16 ***
## Residuals 1593 1071097 672
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and bound sulfur dioxide. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = bound.sulfur.dioxide ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 10.0811321 -15.426418 35.588682 0.8699705
## 5-3 25.6301028 2.063234 49.196971 0.0238790
## 6-3 11.2583072 -12.320052 34.836666 0.7496028
## 7-3 7.0748744 -16.901472 31.051221 0.9596104
## 8-3 6.2666667 -22.912922 35.446256 0.9901376
## 5-4 15.5489707 4.998473 26.099468 0.0003955
## 6-4 1.1771751 -9.398964 11.753314 0.9995713
## 7-4 -3.0062577 -14.442207 8.429691 0.9754993
## 8-4 -3.8144654 -23.997729 16.368798 0.9945496
## 6-5 -14.3717956 -18.448178 -10.295413 0.0000000
## 7-5 -18.5552284 -24.517032 -12.593425 0.0000000
## 8-5 -19.3634361 -37.030533 -1.696339 0.0221440
## 7-6 -4.1834328 -10.190497 1.823631 0.3501748
## 8-6 -4.9916405 -22.674062 12.690781 0.9665845
## 8-7 -0.8082077 -19.017937 17.401521 0.9999955
From the graph above, we could notice that there is a significant difference between 5-3, 5-4, 6-5, 7-5 and 8-5.
From looking at the boxplots above we don’t see much strong relationship between quality and total sulfur dioxide.
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 128045 25609 25.48 <2e-16 ***
## Residuals 1593 1601155 1005
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and total sulfur dioxide. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = total.sulfur.dioxide ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 11.345283 -19.841526 42.532092 0.9051128
## 5-3 31.613950 2.799916 60.427984 0.0219162
## 6-3 15.969906 -12.858177 44.797989 0.6115609
## 7-3 10.120101 -19.194583 39.434784 0.9228115
## 8-3 8.544444 -27.131983 44.220872 0.9838108
## 5-4 20.268667 7.369100 33.168234 0.0001149
## 6-4 4.624623 -8.306295 17.555541 0.9112220
## 7-4 -1.225183 -15.207347 12.756982 0.9998676
## 8-4 -2.800839 -27.477908 21.876231 0.9995284
## 6-5 -15.644044 -20.628033 -10.660055 0.0000000
## 7-5 -21.493850 -28.783049 -14.204650 0.0000000
## 8-5 -23.069506 -44.670183 -1.468828 0.0283464
## 7-6 -5.849805 -13.194343 1.494732 0.2059503
## 8-6 -7.425462 -29.044876 14.193953 0.9243726
## 8-7 -1.575656 -23.839783 20.688471 0.9999539
From the graph above, we could notice that there is a significant difference between 5-3, 5-4, 6-5, 7-5 and 8-5.
From looking at the boxplots above we don’t see much strong relationship between quality and density.
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 0.000230 4.594e-05 13.4 8.12e-13 ***
## Residuals 1593 0.005462 3.430e-06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and density. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = density ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 -9.215472e-04 -0.0027431246 9.000302e-04 0.7003175
## 5-3 -3.603730e-04 -0.0020433600 1.322614e-03 0.9902708
## 6-3 -8.489373e-04 -0.0025327449 8.348703e-04 0.7033996
## 7-3 -1.359729e-03 -0.0030719578 3.525005e-04 0.2088099
## 8-3 -2.251778e-03 -0.0043355875 -1.679681e-04 0.0253891
## 5-4 5.611742e-04 -0.0001922713 1.314620e-03 0.2747470
## 6-4 7.260987e-05 -0.0006826668 8.278865e-04 0.9997910
## 7-4 -4.381815e-04 -0.0012548599 3.784970e-04 0.6443084
## 8-4 -1.330231e-03 -0.0027715834 1.111221e-04 0.0899646
## 6-5 -4.885643e-04 -0.0007796721 -1.974566e-04 0.0000271
## 7-5 -9.993557e-04 -0.0014251075 -5.736038e-04 0.0000000
## 8-5 -1.891405e-03 -0.0031530698 -6.297397e-04 0.0002889
## 7-6 -5.107913e-04 -0.0009397754 -8.180729e-05 0.0090996
## 8-6 -1.402840e-03 -0.0026655999 -1.400810e-04 0.0193569
## 8-7 -8.920491e-04 -0.0021924653 4.083671e-04 0.3677080
From the graph above, we could notice that there is a significant difference between 8-3, 6-5, 7-5, 7-6 and 8-5.
From looking at the boxplots above it seems like ph level decrease as quality increase. However, we should not conclude anything yet.
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 0.51 0.10242 4.342 0.000628 ***
## Residuals 1593 37.58 0.02359
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and pH. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = pH ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 -0.01649057 -0.16757254 0.1345914093 0.9996104
## 5-3 -0.09305140 -0.23263865 0.0465358649 0.4012091
## 6-3 -0.07992790 -0.21958322 0.0597274183 0.5767071
## 7-3 -0.10724623 -0.24925884 0.0347663821 0.2599190
## 8-3 -0.13077778 -0.30360935 0.0420537932 0.2578317
## 5-4 -0.07656083 -0.13905174 -0.0140699183 0.0064502
## 6-4 -0.06343733 -0.12608012 -0.0007945477 0.0451336
## 7-4 -0.09075567 -0.15849113 -0.0230202009 0.0019007
## 8-4 -0.11428721 -0.23383328 0.0052588577 0.0704301
## 6-5 0.01312350 -0.01102104 0.0372680287 0.6312170
## 7-5 -0.01419484 -0.04950677 0.0211171021 0.8615725
## 8-5 -0.03772638 -0.14236912 0.0669163551 0.9083845
## 7-6 -0.02731833 -0.06289835 0.0082616867 0.2425756
## 8-6 -0.05084988 -0.15558338 0.0538836280 0.7359924
## 8-7 -0.02353155 -0.13138831 0.0843252185 0.9893982
From the graph above, we could notice that there is a significant difference between 5-4, 6-4, and 7-4.
From looking at the boxplots above it seems like sulphates level increase as quality increase. However, we should not conclude anything yet.
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 3.00 0.6000 22.27 <2e-16 ***
## Residuals 1593 42.91 0.0269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and sulphates. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = sulphates ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 0.02641509 -0.13504180 0.18787198 0.9972425
## 5-3 0.05096916 -0.09820366 0.20014199 0.9259342
## 6-3 0.10532915 -0.04391640 0.25457471 0.3348774
## 7-3 0.17125628 0.01949155 0.32302101 0.0164864
## 8-3 0.19777778 0.01307773 0.38247782 0.0276634
## 5-4 0.02455407 -0.04222814 0.09133628 0.9011170
## 6-4 0.07891406 0.01196955 0.14585857 0.0102225
## 7-4 0.14484119 0.07245428 0.21722810 0.0000002
## 8-4 0.17136268 0.04360729 0.29911807 0.0018695
## 6-5 0.05435999 0.02855743 0.08016255 0.0000000
## 7-5 0.12028712 0.08255028 0.15802395 0.0000000
## 8-5 0.14680861 0.03497998 0.25863725 0.0025621
## 7-6 0.06592713 0.02790380 0.10395045 0.0000123
## 8-6 0.09244862 -0.01947701 0.20437426 0.1723998
## 8-7 0.02652150 -0.08874188 0.14178487 0.9864895
From the graph above, we could notice that there is a significant difference between 7-3, 8-3, 6-4, 7-4, 8-4, 6-5, 7-5, 8-5, and 7-6.
From looking at the boxplots above it seems like alcohol level increase as quality increase. However, we should not conclude anything yet.
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 483.9 96.79 115.9 <2e-16 ***
## Residuals 1593 1330.8 0.84
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and alcohol. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = alcohol ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 0.31009434 -0.589020145 1.209208824 0.9231095
## 5-3 -0.05529369 -0.886001167 0.775413796 0.9999660
## 6-3 0.67451933 -0.156593176 1.505631838 0.1882542
## 7-3 1.51091290 0.665771726 2.356054069 0.0000056
## 8-3 2.13944444 1.110894424 3.167994465 0.0000001
## 5-4 -0.36538803 -0.737282044 0.006505993 0.0574326
## 6-4 0.36442499 -0.008372862 0.737222846 0.0597032
## 7-4 1.20081856 0.797713311 1.603923806 0.0000000
## 8-4 1.82935010 1.117911150 2.540789059 0.0000000
## 6-5 0.72981302 0.586124800 0.873501234 0.0000000
## 7-5 1.56620658 1.356059244 1.776353923 0.0000000
## 8-5 2.19473813 1.571991432 2.817484828 0.0000000
## 7-6 0.83639357 0.624650838 1.048136295 0.0000000
## 8-6 1.46492511 0.841638238 2.088211988 0.0000000
## 8-7 0.62853155 -0.013342374 1.270405467 0.0589299
From the graph above, we could notice that there is a significant difference between 7-3, 8-3, 7-4, 8-4, 6-5, 7-5, 8-5, 8-6 and 7-6.
F-value - Quality x Fixed Acidity: 6.283 - Quality x Volatile Acidity: 60.91 - Quality x Citric Acid: 19.69 - Quality x Residual Sugar: 1.053 - Quality x Chlorides: 6.036 - Quality x Free Sulfur Dioxide: 4.754 - Quality x Bound Sulfur Dioxide: 29.36 - Quality x Total Sulfur Dioxide: 25.48 - Quality x Density: 13.4 - Quality x pH: 4.342 - Quality x Sulphates: 22.27 - Quality x Alcohol: 115.9
Only residual sugar has low F-value to reject null hypothesis. Alcohol, volatile acidity, bound sulfur dioxide, sulphates, and citric acid had high f-value, so I will use these variables to further investigate the relationship.
From observation of bivariate plot, I noticed that it will better to reorganize quality into three categories as low, medium, and high.
From looking at the graph, high level quality red wines tends to have higher alcohol level compared to other wines. However, other than the alcohol level, it is hard to notice any strong relationship. Instead of dividing quality into low, medium, and high, I think a variable that divided quality into low(3~5) and high(6~8) might be better at explaining the relationship.
From the graph, we could notice that high quality wines tends to have higher alcohol level and lower volatile acidity compared to low quality wines. However, with high variance, we should be carefully about concluding any relationship. I will futher explore with other variables.
From looking at the graph it is hard to notice any relationship. Only thing I could notice is that high quality wines tends to have higher alcohol level compared to low quality wines. I will futher explore with other variables.
From looking at the graph it is hard to notice any relationship. Only thing I could notice is that low alcohol level wine is more dispersed in sulphates level compared to high alcohol level. I will futher explore with other variables.
From looking at the graph, it seems like low quality wines tends to not only have lower alcohol level but also citric acid level. I will futher explore with other variables.
From looking at the graph, high quality wines tend to have low volatile acidity and high bound sulfur dioxide compared to low quality wines. I will futher explore with other variables.
From looking at the graph, it is hard to notice any relationship but we could notice that high quality wine tends to have higher sulphates and lower volatile acidity compared to low quality wine. I will also put alcohol variable into graph to observe the relationship. I will use alcohol level variable to observe the relationship.
From looking at the plot above, we could notice that high quality wine tends to have higher alcohol level, higher sulphate level, and low volatile acidity compared to low quality wine.
I will futher explore with other variables.
From looking at the graph, it is hard to notice any relationship but we could notice that high quality wine tends to have higher citric acid and lower volatile acidity compared to low quality wine. I will futher explore with other variables.
From looking at the graph, it is hard to notice any relationship but we could notice that high quality wine tends to have higher sulphates compared to low quality wine. I will futher explore with other variables.
From looking at the graph, it is hard to notice any relationship but we could notice that high quality wine tends to have higher citric acid compared to low quality wine. I will futher explore with other variables.
From looking at the graph, it is hard to notice any relationship but I could notice that high quality wine tends to have higher sulphates compared low quality wine. I will futher explore with other variables.
For most of the plots, it was hard to identify a strong relationship among variables. As it was shwon in univariate analysis, alcohol seemed to have most influence on the quality of wine than other variables.
The plot indicates that wines with high quality tends to have high alcohol level compared to wines with low quality.
Even though it is not evident, we could notice that high quality wine tends to have high sulphates and low volatile acidity compared to low quality wine. However, it would hard to predict a wine quality through just looking at sulphates and volatile acidity of red wine due to high variance.
From looking at the plot we could notice that high quality wine tends to have high alcohol level, sulphates, and volatile acidity compared to low quality wines. As mention before, the relationship is not strong enough to predict quality of red wine based on its alcohol level, sulphates and volatile acidity.
The red wine data set contained 1600 red wine with 12 variables. I have explored each variables distribution and bivariate model to identify relationship between variables and quality of red wine. I used boxplots to explore the relationship between quality and other variables. The difficulty with boxplots was there was exact standard to conclude whether the relationship bewteen quality and variable was strong enough. Especially, with small number of the data, variance was too large to identify the relationship. With numerous limitation, I still found alcohol, sulphates, and volatile acidity to have most influence on the quality of red wine.
From this project, I realize the quality of red wine, which is decided by wine experts, is more complex to be explained by those 12 given variables. For next time, I hope there could be more variables as price of the wine and data to explore. With the price of the wine and selling records on each wine, we could conduct analysis to see the difference on the preference between wine experts and others. Furthermore if we could have the information who buys which wine, then we could see which age group tends to like wine with high alcohol level.
I strongly feel like with more variables and data, there is so much we further explore.